We recommend visiting the web application to explore data. You must be on the NIH VPN to access the app here:
Alternatively, if you choose to run the SEQUIN locally, you will first need a github account. You can clone the repository by following the step below:
git clone https://github.com/ncats/sctl-rshiny-complex.git
The SEQUIN web app is most easily run using the Docker container. To install Docker, go to the following link and follow the directions.
On Mac, Windows or Linux The end-user will need to create an Rprofile.site file. In a text editor or VIM, type the following:
local({
options(shiny.port = 3838, shiny.host = "0.0.0.0")
})
This specifies the port in which to host the app locally.
Navigate to the cloned github repository. The Dockerfile and the install.R file should be located one directory above the directory named, app/. After Docker is installed locally on your machine, you must navigate to the location of the Dockerfile and build it by running the following command:
docker build -t sequin_app .
The build process takes a couple of hours, which is why we recommmend accessing the app while on the NCATS VPN. After the build is complete, you can access the app by typing the following command in the terminal:
docker run -p 3838:3838 sequin_app
You will need to navigate to your browser (we recommend chrome) and type the following:
http://localhost:3838
It will take a few minutes to load the app as there are several libraries loaded initially.
When you land at https://sequin-ci.ncats.io/, a pop-up box will include some background information about the web application as shown below.
Select either single-cell RNA-Seq or Bulk RNA-seq data.
Under the Existing datasets header, choose one or more existing data sets. Click on a row to select a data set. Click again to de-select.
Once the data set(s) are selected, you can subset the data to target specific factors (e.g. specific samples, condition, Seurat clusters, etc.)
Subset data and a pop-up modal will appear as shown below. Here, select specific groups based on the factor specified in Select factor.
Select all or Deselect all as shown below.
If you want to load the entire dataset rather than subsetting the data, just click on the Load full dataset(s) button. If you have subset the data and want to load it, click on the Load selected sample(s).
There are several options on the scRNA-Seq side including: Type of clustering (explained in detail below), setting the minimum detection rate for all genes (Detection rate threshold), excluding rRNA, mitochondrial and pseudo genes (Exclude RNA/MT/pseudo genes) and down sampling the total number of cells for an analysis (Downsample cells).
Type of clustering, you can choose Use pre-assigned clusters in metadata, which will utilize the pre-computed clusters in the metadata. If you choose Run multiple resolutions using Seurat, this will run multiple resolutions in Seurat from 0.4 to 2.8 with a 0.4 step for you to view in later areas in the application. This is useful for comparing results from different resolutions.
For particularly large scRNA-Seq datasets (over 10,000 cells), we automatically default to down sampling the data set to 10,000 cells. For down sampling, you’ll need to select the maximum number of cells you want to down sample under Max cells. We use a random seed number to start random sampling, so if you would like to start in a different location, please change the value in Random seed. If you would like to retain the entire data set, please un-check the box for down sampling.
We have included advanced options on the scRNA-Seq including setting the Detection rate threshold and PCA cumulative variance %. For DGE, this tests only genes detected in a minimum fraction of cells. When analyzing scRNA-Seq data, Seurat clusteres cells based on their PCA scores. The top PCs represent robust compression of the given dataset. The PCA cumulative variance percentage sets a cap on the total cumulative variance and the total PCs that encompass that variance. If the end-user does not set this, the default is set at 75%.
If you have visited the web application before and have created your own clusters under the Merge clusters tab or by cells under Group cells by gene expression, the updated metadata that you pushed to the RDS will appear below Select dataset(s) from existing experiment under the header Option: Select user-updated metadata. As shown below, under the column Experiment, the name of the existing dataset will be listed. Under the column Username, the username you created will be listed as well as the date that the data was pushed to the RDS under Date. Under the column Table name, we automatically concatenate the existing dataset with the username and a number. If you create multiple metadata tables, we append an iterative number so you can distinguish these metadata tables. Lastly, under the column Note, we include any Comment you added in this section or this section. See section or this section for more information. Click on the checkbox to use your user-updated metadata file.
There are also several options on the bulk RNA-Seq side including: Filtering above a minimum row sum of counts (Min. total counts per gene), the type of transformation to perform on the data (Count transformation method) and excluding rRNA, mitochondrial and pseudo genes (Exclude RNA/MT/pseudo genes).
Min. total counts per gene, this refers to the sum of all counts, across samples for a given row (gene). If you set this value to 10, then all row sums that are less than 10 will be filtered from the data.Count transformation method. We include the following options:
Another option that is available on the bulk RNA-Seq side is the option to remove an outlier sample.
Subset data, an additional option will be available to Identify outliers; click on this button. An additional pop-up will appear; click the button that says Run PCA. This will load a PCA plot and allow you to select specific samples to remove.
Current selection: (displayed below as 2.). Click on +Add sample. This will remove this outlier sample from your analysis.
Selected samples: (displayed below as 1.). And finally to submit outliers, click on Submit outliers as shown below.
Subset data pop-up. You will see all samples selected expect the ones you removed as outlier(s) (see below).
You can also load your own metadata and count matrices. Simply, click on the Custom dataset tab to upload your own data.
Browse button to upload both the count and metadata matrices, separately (labeled below as 2. and 3.). Please note, it make take a few seconds for the data to load. Please make sure you see the message, “Upload Complete” before proceeding (labeled below as 4. and 5.).
+ Add existing experiments and a pop-up will appear. Click on the checkbox to add an experiment as shown below. -Clear existing experiments as shown below.
Download raw counts or Download normalized data.
Download metadata as shown below.orig.ident (original cell identity classes). For bulk RNA-seq, samples may be grouped based on one or more factors. Shown below is an example of a bulk RNA-Seq box and whisker plot of the count data grouped by the factor condition. To download plots, click on the Download static plot (PDF) or Download static plot (PNG)`.
The histogram shows the average frequency of transformed normalized counts for the selected experimental factor. For single-cell RNA-seq, cells are grouped based on orig.ident (original cell identity classes). For bulk RNA-seq, samples may be grouped based on one or more factors. Shown below is an example of a bulk RNA-Seq box and histogram of the count data grouped by the factor condition. To download plots, click on the Download static plot (PDF) or Download static plot (PNG)`.
The barplot shows the total read counts for the selected factor. For single-cell RNA-seq, cells are grouped based on orig.ident (original cell identity classes). For bulk RNA-seq, samples may be grouped based on one or more factors. Shown below is an example of a bulk RNA-Seq box and a histogram of the total reads grouped by the factor condition. To download plots, click on the Download static plot (PDF) or Download static plot (PNG)`.
Click on the tab Discovery-driven analyses/Correlation to explore the correlation between samples, distance matrices (bulk RNA-Seq) and clustering using dimensional reduction plots.
All under Genes and input the total number of samples or cells under Cells. Please be aware, if the dataset contains over 10,000 cells and thousands of genes, the correlation matrix will take a long time to build and visualize!
Show column/row labels (may not align correctly for large heatmaps) as shown below.
Clustering of samples/cells can be viewed in dimensional reduction plots under Discovery-driven analyses/Clustering. Details and optional parameters are described below.
Grouping factor as shown below. For bulk RNA-Seq data, only PCA plots are available for visualization.
Method as shown below.
Differential gene expression (DGE) and related downstream analyses are accessed at the DGE analysis page. Different visualizations and analyses are available for bulk RNA-seq and scRNA-seq as detailed below.
To run DGE analysis, several options are available including: the experimental design, the factor and grouping variables (if required by the experimental design), the DGE method, the option to batch correct, the adjusted p-value cut-off and the minimum fold-change. These options are detailed below.
Two group comparisons: This design tests for differentially expressed genes between two experimental groups within a given factor from the metadata. Only one experimental factor can be selected and two comparisons within that factor can be made. For example, in the image below, the selected Factor is treatment, with comparisons between Group 1, treated and Group 2, untreated. As is shown under Linear model:, day is the only factor used to predict differential gene expression. For Two group comparisons we also have the option to compare one sample to the rest of the samples as shown below.
Multiple factor comparisons (factorial): DE testing is performed between levels from different factors. For example, if factor A is treatment with Factor A group 1, untreated and Factor A group 2 treated while Factor B is day with Factor B group 1 day 1 and Factor B group 2 day 20, DE testing will be performed between treated_day1 vs. untreated_day20. As is shown under Linear model:, the independent variables used to predict differential gene expression will be the combination of treatment and day, untreated_day1 vs treated_day20, only.
Classical interaction design: DE is performed between levels of each factor chosen relative to a reference. This experimental method tests between all possible combinations of each non-reference level vs. the corresponding reference. For example, if Factor A is “day” with Factor A reference, “day 1” and Factor B is “treatment” with Factor B reference, “untreated”, the factor levels are day and treatment. The reference is to day 1 and untreated. As is shown under Linear model:, day, treatment and the interaction of these two independent variables are used to predict differential gene expression. The contrasts will include: day 1 vs day 20, treated vs untreated, the interacton term of day 20, treated and an intercept term.
Blocking factor is day with a Blocking factor reference of day 1, day 1 will be removed from the downstream DE analysis. If the Treatment factor is treatment with Treatment factor reference of untreated, then the subsequent contrasts will include day 20, treated vs untreated and an intercept as predictors of differential gene expression. As is shown under Linear model:, day and treatment are the two independent variables used to predict differential gene expression while removing day 1 as a factor.
Main effect factor is cell_line with the Main effect reference H9 with Group 1 H9 and Group 2 LiPSC.GR1.1. As is shown under Linear model:, condition is the only independent variable used to predict differential gene expression relative to cell_line H9.
Main effects, Main effect factor is the only factor used to predict differential gene expression; however, a Grouping factor is used to perform DE testing within the selected factor. In the example below, the Main effect factor is cell_line with Main effect reference H9, Group 1 LiPSC.GR1.1andGroup 2H9. TheGrouping factoris treatment withGrouping factor levelY27. The final contrast will be LiPSC.GR1.1 vs H9 within the Y27 treatment group. As is shown in theLinear model:`, cell_line is the main independent variable used to predict differential gene expression while only comparing with the Y27 treatment.
Three multiple methods to perform differential gene expression analysis under DGE method. The three methods include DESeq2, edgeR and limma-voom.
Adj. p-value and the minimum log fold-change under Min. fold change for differential gene expression.
Submit to run DE testing.DGE regulation table is generated detailing the contrasts (Comparison), the whether the gene is up or downregulated in a particular comparison (Regulation) and the total number of gene IDs that are up or downregulated (IDs). In the example below, there are 130 gene IDs upregulated in treated vs untreated and 108 gene IDs downregulated in treated vs untreated.
DGE regulation table, we also generate a barplot showing the total total number of genes up and down-regulated in DE testing.
After DE testing is completed, you can visualize the significant up- and down-regulated genes using a volcano or MA plot. If there are multiple contrasts to visualize, you can change the contrast under the Contrast drop-down menu (circled in red below). The Volcano plot shows the -log10(p-value) vs log2(fold-change). We also highlight the down- and up-regulated genes as shown in the DGE regulation table. You can hover over each data point to get more information about the gene that was differentially expressed.
Similarly, you can view an Bland-altman (MA) plot with the same visualizations except plotting log2(fold-change) vs log10(baseMean).
The DGE table contains a list of the signficantly differentially expressed genes based on the submitted Min. fold change and the Adj. p-val cutoff. The description of the columns
ContrastThere are multiple parameters for gene set enrichment (GSE) that include: input gene list, contrast, gene list filtering, use top n genes and selection of EnrichR libraries.
Input gene list: you can select DGE filtered, which is the identified differentially expressed genes. Included is an option to compare a given sample to the rest of the samples.
Custom allows you to provide genes manually. When providing genes, please use hgnc symbols separated by “,”. Provide one or more genes for analysis.
Contrast, you may select the contrast based on the Experimental design.
Gene list filtering, you may select all significant differentially expressed genes (All DE genes), only those genes significantly up-regulated as shown below (Up in D1) or only those genes significantly upregulated as shown below (Up in D20). Use top n genes we perform GSE on the top 100 significantly differentially expressed genes identified in DE testing. Alternatively, you can select Use all genes passing filters.
Data size:, a list of the total genes used to perform GSE analysis will be listed here.
Run GSE.After GSE is run, a table is produced providing the following information:
There are multiple parameters for visualizing a heatmap of normalilzed, scaled gene expression values that are similar to GSE including: Input gene list, Contrast, Gene list filtering, Use top n genes. In addition, you have the option to change the coloring scheme in the heatmap using Color palette, select the factor from the metadata to color samples in Choose factor(s) for labelling and Cluster genes and Cluster samples.
Color palette and includes: default (red-white-blue), red-blue (darker red-white-blue), viridis (blue-green-yellow) and green-yellow-red.
Use samples from contrast only by checking the box. This will only show samples from the selected comparison under Contrast (in this example treated vs untreated).
Cluster genes and Cluster samples, dendrograms will cluster genes and samples, respectively, based on the hclust clustering algorithm and will appear on your heatmap.
Scale genes, this scales the expression data via linear transformation (mean equal to 0 and variance equal to 1). The purpose behind scaling is to ensure that highly-expressed genes do not dominate downstream analyses.
Data size:. To build the heatmap, click Build heatmap.
After clicking Build heatmap, the heatmap will be generated as shown below. In this example, the samples are colored by day at the top of the heatmap and the default cell coloring was used to plot gene expression. In addition, both samples and genes are clustered in this example.
You can also hover over points in the heatmap to identify the gene and normalized, scaled expression value.
To explore a specific gene and how it varies over a factor level, simply click a square within the heatmap to visualize normalized counts by factor in box and whisker plot. In the example below, the factor selected was day under Choose factor. For the gene, FAM111B, you can see the normalized count data between day 1 and day 20.
Top variable genes and also select the minimum module size Min.module size. The module size is the minimum number of clusters generated for the gene dendrogram. Click on Launch clustering analysis.
Dynamic tree cut, which is a color bar indicating the number and size of gene modules. A gene module is a set of genes with correlated expression across samples. The gene modules can be downloaded by clicking Download gene modules (CSV). The modules are identified and named by color.Top variable genes. Click on Launch clustering analysis.
When scRNA-Seq datasets are selected, there are two options for loading data. The first is to utilize cluster identification using the metadata provided (Use pre-assigned clusters from metadata) and the second, is to perform multiple resolutions on the data (Run multiple resolutions using Seurat). When running multiple resolutions, the first page you will come to when exploring scRNA-Seq data is the Clustree tab.
In Select resolution, you will see a dimensional reduction plot, with the option of visualizing the clustered data by selected resolution. You will see the output from Clustree, which can guide you in your decisions to select the appropriate resolution for clustering.
Grouping factor.
The Overview tab displays summary plots of cluster and metadata quality.
Clusters: These plots show visualizations of clustering quality based on selected resolution or based on clustering in the metadata. More details are described below.
Clusters separation, you can select the number of DE genes per cluster compared to the closest cluster, the number of DE genes per cluster compared to all other clusters or the average silhouette width per cluster (select under Cluster separation metric). You can adjust the false discovery rate (FDR) for DE testing.Silhouette plot shows the silhouette widths and their averages per cluster. Ideal clustering should show a silhouette plot with primarily positive and minimal negative widths.
Metadata: These plots show relationships between metadata variables.
Metadata relationships between metadata variables are displayed as either boxplots or scatterplots, depending on the format of the variables selected. For instance, cluster vs. nFeature_RNA (the number of genes) is displayed as a boxplot, while nFeature_RNA vs. nCount_RNA (total RNA counts) is displayed as a scatterplot. For scatterplots, boxplots are also displayed on the x- and y-axes reflecting the distribution of values for the x- and y-axis individually.Metdata by cluster show the summary of individual metadata variables by the selected Metadata. These are displayed as either boxplots or barplots, depending on the format of the variable (numeric and factor, respectively).
The Gene expression tab shows gene expession across clusters by box and whisker plot and dimensional reduction plot.
Gene expression by cluster allows you to select a particular gene and see its normalized expression by cluster using a box and whisker plot. We also include a dendrogram that reflects similarities among clusters based on the selected gene. Optional parameters are described below. There are two options for grouping my cluster: Include jitter and Include detection rate. Selection of both is the default. Jitter overlays individual gene expression values outside of the interquartile range (IQR). Including the detection rate threshold is the percentage of cells expressing the gene in a given cluster (shown as dashes).
Cell distribution of genes of interest allows you to select a particular gene and visualize normalized gene expression across clusters in a dimensional reduction plot. You can select among the following options:
Gene: From the drop-down menu, select your gene of interest to visualize.Cell embedding: choose either PCA, UMAP, or tSNE to visualize the dimensional reduction plot.x-axis: choose the dimension to visualize on the x-axisy-axis: choose the dimension to visualize on the y-axis
Plot, you can select visualization by Gene expression overlay or Clusters. If you select Clusters, cells will be plotted in a dimensional reduction plot without showing gene expression.Include cluster labels (style as above), the cluster numbers will be overlaid above the clusters
The DGE tab shows differential gene expression results visually in a dotplot and summarized in a table.
DGE by cluster allows you to select a specific cluster under Cluster, adjust the FDR and perform DE testing under Dotplot genes via DE vs rest or DE vs neighbor.
# genes per cluster to show in the dotplot.DGE table presents statistics for DE genes with your set absolute log2(fold-change) (Abs. log~2~fold-change) and adjusted p-value (Adj. p-value). Run differential gene expression by clicking Load table.
DGE table includes the following columns:
Custom DGE allows you to perform differential gene expression for all available factors in the metadata and not just cluster comparisons. As shown below, you can select the Factor from the metadata (i.e. orig.ident, day, treatment, etc.), which in this case is orig.ident. In addition, you can select the two groups to compare. As shown below, Group 1 is All (pairwise) while Group 2 is Rest. This will perform pairwise DE testing between each orig.ident group (i.e. Morulae, Zygote, etc.) vs the rest of the clusters. This will be the most time consuming DGE test. The end-user can also select single comparisons like Morulae vs Zygote or do one specific comparison of say Morulae vs the rest of the clusters.
View options. This will allow you to adjust the log2(fold-change) (LFC threshold), the Test you would like to use to perform DE testing, the minimum cut-off for gene detection Min.pct and the minimum difference in gene expression between Group 1 and Group 2.
Submit, the Differential gene expression results will produce a table including the following columns:
Group 1 in the Comparison drop-down menuGroup 2 in the Comparison drop-down menu
The Volcano plot tab compares two selected clusters or sets of cells using the selected plot types:
# top genes to label and the type of genes to label (i.e. Largest fold-changes)Cluster A and Cluster BAbs log~2~ fold-change and change the adjusted p-value (Adj. p-value). To build the table click onLoad table`.
For the Heatmap on the scRNA-Seq side, there are three options for visualizing DEGs: Cluster DGE, Custom DGE and Manual.
For gene set enrichment (GSE), there are three options for performing GSE on DEGs: Cluster DGE, Custom DGE and Manual.
Factor and Contrast selected under Custom DGE contrasts will be atuo-selected to perform enrichment analysis.
We have enabled you to select cells manually or by filtering to perform custom DE testing. You can perform DGE and GSE on your selections.
x-axis and y-axis
Metadata overlay and filtering.
+Add button, then select the cells by the selected factor under Select cells by <factor>.
+Set A: Add cells. You will see the total number of cells added above +Set A: Add cells.
Select cells by <factor>. Then click on +Set B: Add cells.You will see the total number of cells added above +Set B: Add cells.
-Remove (step 1. shown below). To remove cells from set A, click on -Set A: Remove cells. To remove cells from set B, click on -Set B: Remove cells.
+Set A: Add cells to include the selected cells in set A (step 2 below).
Calculate differential gene expression, you must include a name for this comparison under Short name for this comparison.
Abs. log~2~fold-change and Adj. p-value, respectively. Then click on Download DGE data to download your DE testing data.
Heatmap as decribed in this section
Gene set enrichment tab and select Set A-Set B under Cluster contrast as shown below.
If you are not completely satisfied with the clustering provided in the metadata or after running a variety of resolutions, you can combine clusters under Merge clusters.
Select clusters to combine, either select or type in clusters you would like to combine.
Username (step 1 below). You can also include any notes about the cluster combination under Comment (step 2 below). The Updated table name will be automatically updated, so you do not need to add anything in that box. Lastly, click on Save to database to save your updated metadata in the RDS (step 3 below. This updated metadata can be loaded as described in this sectionSelect genes. You can either type in the gene name or scroll through the list to select your target gene. To begin creating clusters, type in the name you would like to use under New set name and click on +New set to create a set name or keep it labeled as Set 1, etc.
+, click and hold down your mouse while using the lasso tool to draw a dotted line around your selected cells as shown below in step 1. Next, click Add to <set_name> as shown in step 2. Lastly, you will see how many cells will be included in that set as shown below in step 3.
Select genes. You can specify how you would like to visualize the gene expression (i.e. mean, median or sum) under Summary metric.
Username (step 1). Optionally, add a comment about your clustering schema under Comment (step 2). Please do *not* type anything intoUpdated table name(step 3). Lastly, click onSave to database` to store your updated metadata in the RDS (step 4).
Add to <set name> as shown below in steps 1 - 4. In this example, we selected ranges 0 - 1 for Cluster 1. This can be iteratively selected for different ranges and added to each additional set.
inclusive is clicked, the range includes the minimum or maximum value. If the checkbox is not clicked, the range includes all values greater than From (min <number>) and/or all values less than From (max <number>).
The iPSC profiler enables visualization of the expression of gene modules. A gene module is a set of genes that comprise a gene expression pathway, etc. Expression, as it is defined here, is quantified by a module score, which is a measure of the exprpession of genes in the module compared to a random set of genes.
The heatmap shows the module scores across all cells by module. By default, module scores from all available modules are shown. Optional parameters for the heatmap are described below.
Grouping factor: select the factor used to group cells (i.e. the colorbar above the heatmap). This comes from the metadata.Color palette: the option to color your heatmap cells by normalized expression value includes: default (red-white-blue), red-blue (darker red-white-blue), viridis (blue-green-yellow) and green-yellow-red.Use all modules: if this is checked, all gene modules will be included in the heatmap. If you would like to only view select module(s), uncheck the checkbox Use all modules.
Select profiles. A heatmap will be generated automatically.
Module scores are projected onto either PCA, UMAP or tSNE dimensional reduction plots in the left plot. The dimensional reduction plot is replicated in the right plot, colored according to the factor selected.
Module.Grouping factor.
Module scores for all cells are visualized as violin plots, with cells grouped based on a chosen factor.
Grouping factor.Module.
More, includes this tutorial under Tutorial, a frequently asked questions under FAQ, information about the group that created this web application under About us and lastly information about the R session under Session info.